Online Active Learning Methods for Fast Label-Efficient Spam Filtering

نویسنده

  • D. Sculley
چکیده

Active learning methods seek to reduce the number of labeled examples needed to train an effective classifier, and have natural appeal in spam filtering applications where trustworthy labels for messages may be costly to acquire. Past investigations of active learning in spam filtering have focused on the pool-based scenario, where there is assumed to be a large, unlabeled data set and the goal is to iteratively identify the best subset of examples for which to request labels. However, even with optimizations this is a costly approach. We investigate an online active learning scenario where the filter is exposed to a stream of messages which must be classified one at a time. The filter may only request a label for a given message immediately after it has been classified. The goal is to achieve strong online classification performance with few label requests. This is a novel scenario for low-cost active spam filtering, fitting for application in large-scale systems. We draw from the label efficient machine learning literature to investigate several approaches to selective sampling in this scenario using linear classifiers. We show that online active learning can dramatically reduce labeling and training costs with negligible additional overhead while maintaining high levels of classification performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Advances in Online Learning-based Spam Filtering

The low cost of digital communication has given rise to the problem of email spam, which is unwanted, harmful, or abusive electronic content. In this thesis, we present several advances in the application of online machine learning methods for automatically filtering spam. We detail a sliding-window variant of Support Vector Machines that yields state of the art results for the standard online ...

متن کامل

Active Multi-Field Learning for Spam Filtering

Ubiquitous spam messages cause a serious waste of time and resources. This paper addresses the practical spam filtering problem, and proposes a universal approach to fight with various spam messages. The proposed active multi-field learning approach is based on: 1) It is cost-sensitive to obtain a label for a realworld spam filter, which suggests an active learning idea; and 2) Different messag...

متن کامل

Spam Filtering Using Statistical Data Compression Models

Spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. Since spam evolves continuously and most practical applications are based on online user feedback, the task calls for fast, incremental and robust learning algorithms. In this paper, we investigate a novel app...

متن کامل

Active Learning Image Spam Hunter

Image spam is annoying email users around the world. Most previous work for image spam detection focuses on supervised learning approaches. However, it is costly to get enough trustworthy labels for learning, especially for an adversarial problem where spammers constantly modify patterns to evade the classifier. To address this issue, we employ the principle of active learning where the learner...

متن کامل

Relaxed Online SVMs in the TREC Spam Filtering Track

Relaxed Online Support Vector Machines (ROSVMs) have recently been proposed as an efficient methodology for attaining an approximate SVM solution for streaming data such as the online spam filtering task. Here, we apply ROSVMs in the TREC 2007 Spam filtering track and report results. In particular, we explore the effect of various slidingwindow sizes, trading off computation cost against classi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007